Annotating Resources for Information Extraction
نویسندگان
چکیده
Trained systems for NE extraction have shown significant promise because of their robustness to errorful input and rapid adaptability. However, these learning algorithms have transferred the cost of development from skilled computational linguistic expertise to data annotation, putting a new premium on effective ways to produce high-quality annotated resources at minimal cost. The paper reflects on BBN’s four years of experience in the annotation of training data for Named Entity (NE) extraction systems discussing useful techniques for maximizing data quality and quantity.
منابع مشابه
Classifying articles in English and German Wikipedia
Named Entity (NE) information is critical for Information Extraction (IE) tasks. However, the cost of manually annotating sufficient data for training purposes, especially for multiple languages, is prohibitive, meaning automated methods for developing resources are crucial. We investigate the automatic generation of NE annotated data in German from Wikipedia. By incorporating structural featur...
متن کاملAnnotating Events and Temporal Information in Newswire Texts
Abstract If one is concerned with natural language processing applications such as information extraction (IE), which typically involve extracting information about temporally situated scenarios, the ability to accurately position key events in time is of great importance. To date only minimal work has been done in the IE community concerning the extraction of temporal information from text, an...
متن کاملGuidelines for Annotating Temporal Information
This paper introduces a set of guidelines for annotating time expressions with a canonicalized representation of the times they refer to. Applications that can benefit from such an annotated corpus include information extraction (e.g., normalizing temporal references for database entry), question answering (answering “when” questions), summarization (temporally ordering information), machine tr...
متن کاملImplementation and Evaluation of a Negation Tagger in a Pipeline-based System for Information Extraction from Pathology Reports
We have developed a pipeline-based system for automated annotation of Surgical Pathology Reports with UMLS terms that builds on GATE--an open-source architecture for language engineering. The system includes a module for detecting and annotating negated concepts, which implements the NegEx algorithm--an algorithm originally described for use in discharge summaries and radiology reports. We desc...
متن کاملAnnotating and Recognizing Event Modality in Text
Current results in basic Information Extraction tasks such as Named Entity Recognition or Event Extraction suggest that we are close to achieving a stage where the fundamental units for text understanding are put together; namely, predicates and their arguments. However, other layers of information, such as event modality, are essential for understanding, since the inferences derivable from fac...
متن کامل